On the Power of Adaptivity in Sparse Recovery 



Piotr Indyk Eric Price David P. Woodruff 

October 19, 2011 

Abstract 

The goal of (stable) sparse recovery is to recover a fc-sparse approximation x* of a vector x from 
linear measurements of x. Specifically, the goal is to recover x* such that 

\\x-x*\\ <C min \\x-x'\\ 

^ /c-sparse x' ^ 

for some constant C and norm parameters p and q. It is known that, for p = q = 1 ot p = q = 2, this 
task can be accomplished using m = 0{k \og{n/k)) non-adaptive measurements MCRT06I and that this 
bound is tight I DIPW 1 Ol iFPRUTOl IPW 1 1 1 . 

In this paper we show that if one is allowed to perform measurements that are adaptive , then the 
number of measurements can be considerably reduced. Specifically, for C — 1 + e and p = q = 2 we 
show 

• A scheme with m — 0{^klog\og{ne/k)) measurements that uses 0(log* fc • loglog(ne/fc)) 
rounds. This is a significant improvement over the best possible non-adaptive bound. 

• A scheme with m — 0{-k\og{k/e) + k\og{n/k)) measurements that uses two rounds. This 
improves over the best possible non-adaptive bound. 

To the best of our knowledge, these are the first results of this type. 

As an independent application, we show how to solve the problem of finding a duplicate in a data 
stream of n items drawn from {l,2,...,n — 1} using 0(log n) bits of space and 0(log log n) passes, 
improving over the best possible space complexity achievable using a single pass. 



1 Introduction 



In recent years, a new "linear" approach for obtaining a succinct approximate representation of n-dimensional 
vectors (or signals) has been discovered. For any signal x, the representation is equal to Ax, where A is an 
m X n matrix, or possibly a random variable chosen from some distribution over such matrices. The vector 
Ax is often referred to as the measurement vector or linear sketch of x. Although m is typically much 
smaller than n, the sketch Ax often contains plenty of useful information about the signal x. 

A particularly useful and well-studied problem is that of stable sparse recovery. We say that a vector 
x' is fc-sparse if it has at most k non-zero coordinates. The sparse recovery problem is typically defined 
as follows: for some norm parameters p and q and an approximation factor C > 0, given Ax, recover an 
"approximation" vector x* such that 

ll^; — 3;*|L < C* mill llx — x'll (1) 

^ fc-sparse x' 1 

(this inequality is often referred to as £p/£q guarantee). If the matrix A is random, then Equation ^ 
should hold for each x with some probability (say, 2/3). Sparse recovery has a tremendous number of 
applications in areas such as compressive sensing of signals IICRT061 [Don06ll . genetic data acquisition and 
analysis IISAZ10[|bGK+10|| and data stream algorithm^ IIMut051llnd07l : the latter includes applications to 



network monitoring and data analysis. 

It is known [CRT06] that there exist matrices A and associated recovery algorithms that produce ap- 
proximations X* satisfying Equation ([T]) with p = q = I, constant approximation factor C, and sketch 
length 

m = 0{klog{n/k)) (2) 

A similar bound, albeit using random matrices A, was later obtained for p = g = 2 P GLPSlOl (building 
on IICCF021 ICM041 ICM06I ). Specifically, for C = 1 + e, they provide a distribution over matrices A with 

m = 0{-klog{n/k)) (3) 
e 

rows, together with an associated recovery algorithm. 

It is also known that the bound in Equation Q is asymptotically optimal for some constant C and 
p = q = l,see HDIPWIOII and OFPRUlOll (building on IIGG84[ iGluMl IKas77ll ). The bound of HDIPWIOII 
also extends to the randomized case and p = q = 2. For C = 1 + e, a lower bound of m = Q{^k \og{n/k)) 
was recently shown IPWl 1 1 for the randomized case and p = q = 2, improving upon the earlier work of 
IIDIPWIOII and showing the dependence on e is optimal. The necessity of the "extra" logarithmic factor 
multiplying k is quite unfortunate: the sketch length determines the "compression rate", and for large n any 
logarithmic factor can worsen that rate tenfold. 

In this paper we show that this extra factor can be greatly reduced if we allow the measurement process 
to be adaptive. In the adaptive case, the measurements are chosen in rounds, and the choice of the mea- 
surements in each round depends on the outcome of the measurements in the previous rounds. The adaptive 
measurement model has received a fair amount of attention in the recent years llJXCOSi ICHNROSl IHCN091 
|HBCN09„ .MSW081 I^ZOS I. see also iPeflOl . In particular iHBCN09ll showed that adaptivity helps re- 
ducing the approximation error in the presence of random noise. However, no asymptotic improvement to 
the number of measurements needed for sparse recovery (as in Equation ([T]l) was previously known. 



'in streaming applications, a data stream is modeled as a sequence of linear operations on an (implicit) vector x. Example 
operations include increments or decrements of x's coordinates. Since such operations can be directly performed on the linear 
sketch Ax, one can maintain the sketch using only 0(m) words. 
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Results In this paper we show that adaptivity can lead to very significant improvements in the number 
of measurements over the bounds in Equations ^ and We consider randomized sparse recovery with 
£2/^2 guarantee, and show two results: 

1. A scheme with m = 0{^k log log(ne/A;)) measurements and an approximation factor C = 1 + e. For 
low values of k this provides an exponential improvement over the best possible non-adaptive bound. 
The scheme uses 0(log* k ■ log log(ne/A;)) rounds. 

2. A scheme with m = 0{\k\og{k/e) + k\og{n/k)) and an approximation factor C = 1 + e. For 
low values of k and e this offers a significant improvement over the best possible non-adaptive bound, 
since the dependence on n and e is "split" between two terms. The scheme uses only two rounds. 

Implications Our new bounds lead to potentially significant improvements to efficiency of sparse recovery 
schemes in a number of application domains. Naturally, not all applications support adaptive measurements. 
For example, network monitoring requires the measurements to be performed simultaneously, since we 
cannot ask the network to "re-run" the packets all over again. However, a surprising number of apphcations 
are capable of supporting adaptivity. For example: 

• Streaming algorithms for data analysis: since each measurement round can be implemented by one 
pass over the data, adaptive schemes simply correspond to multiple-pass streaming algorithms (see 
IIMcG09 l for some examples of such algorithms). 

• Compressed sensing of signals: several architectures for compressive sensing, e.g., the single-pixel 
camera of [DDT+OS], already perform the measurements in a sequential manner. In such cases the 
measurements can be made adaptivfi Other architectures supporting adaptivity are under develop- 
ment nPeflOL 

• Genetic data analysis and acqusition: as above. 

Therefore, it seems Ukely that the results in this paper will be applicable in a wide variety of scenarios. 

As an example application, we show how to solve the problem of finding a duplicate in a data stream of 
n arbitrarily chosen items from the set {1, 2, . . . , n — 1} presented in an arbitrary order. Our algorithm uses 
0(log n) bits of space and 0(log log n) passes. It is known that for a single pass, 0(log^ n) bits of space 
is necessary and sufficient USTIIl . and so our algorithm improves upon the best possible space complexity 
using a single pass. 

Techniques On a high-level, both of our schemes follow the same two-step process. First, we reduce the 
problem of finding the best /c-sparse approximation to the problem of finding the best 1-sparse approximation 
(using relatively standard techniques). This is followed by solving the latter (simpler) problem. 

The first scheme starts by "isolating" most of of the large coefficients by randomly sampling ^ e/k 
fraction of the coordinates; this mostly follows the approach of IIGLPS IOI (cf. f GGI"'"02l ). The crux of the 
algorithm is in the identification of the isolated coefficients. Note that in order to accomplish this using 
O(loglogn) measurements (as opposed to O(logn) achieved by the "standard" binary search algorithm) 
we need to "extract" significantly more than one bit of information per measurements. To achieve this, we 
proceed as follows. First, observe that if the given vector (say, z) is exactly 1-sparse, then one can extract 
the position of the non-zero entry (say Zj) from two measurements a{z) = J2i^i^ ^^'^ ^(^) = Z^i^-^*' 

^We note that, in any realistic sensing system, minimizing tiie number of measurements is only one of several considerations. 
Other factors include: minimizing the computation time, minimizing the amount of communication needed to transfer the mea- 
surement matrices to the sensor, satisfying constraints on the measurement matrix imposed by the hardware etc. A detailed cost 
analysis covering all of these factors is architecture-specific, and beyond the scope of this paper. 
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since b{z)/a{z) = j. A similar algorithm works even if z contains some "very small" non-zero entries: 
we just round b{z)/a{z) to the nearest integer. This algorithm is a special case of a general algorithm that 
achieves 0{\ogn / log S N K) measurements to identify a single coordinate Xj among n coordinates, where 
SNR = (SNR stands for signal-to-noise ratio). This is optimal as a function of n and the 

SNR HDIPWIOL 

A natural approach would then be to partition [n] into two sets {!,..., n/2} and {n/2 + 1, . . . n}, find 
the heavier of the two sets, and recurse. This would take O(logn) rounds. The key observation is that 
not only do we recurse on a smaller-sized set of coordinates, but the SNR has also increased since j;| has 
remained the same but the squared norm of the tail has dropped by a constant factor. Therefore in the 
next round we can afford to partition our set into more than two sets, since as long as we keep the ratio of 
log(# of sets ) and log SNR constant, we only need 0(1) measurements per round. This ultimately leads 
to a scheme that finishes after 0(log log n) rounds. 

In the second scheme, we start by hashing the coordinates into a universe of size polynomial in k and 
1/e, in a way that approximately preserves the top coefficients without introducing spurious ones, and in 
such a way that the mass of the tail of the vector does not increase significantly by hashing. This idea is 
inspired by techniques in the data stream literature for estimating moments | KN PWlOl ITZ04II (cf . IICCF021 
ICM061 IGIIOI ). Here, though, we need stronger error bounds. This enables us to identify the positions of 
those coefficients (in the hashed space) using only 0{^klog{k/e)) measurements. Once this is done, for 
each large coefficient i in the hash space, we identify the actual large coefficient in the preimage of i. This 
can be achieved using the number of measurements that does not depend on e. 



2 Preliminaries 

We start from a few definitions. Let x be an n-dimensional vector. 
Definition 2.1. Define 



Hk{x) = argmax IIX5II2 

Se[n] 
\S\=k 



to be the largest k coefficients in x. 



Definition 2.2. For any vector x, we define the "heavy hitters " to be those elements that are both ( i) in the 
top k and ( ii) large relative to the mass outside the top k. We define 



Hk,e{x) = {j G Hk{x) 



x]>e 



Definition 2.3. Define the error 



Err^(x, k) 



For the sake of clarity, the analysis of the algorithm in section |4] assumes that the entries of x are sorted 
by the absolute value (i.e., we have |xi| > \x2\ > ■ ■ ■ > \xn\). In this case, the set Hk{x) is equal to [k]; 
this allows us to simplify the notation and avoid double subscripts. The algorithms themselves are invariant 
under the permutation of the coordinates of x. 



Running times of tlie recovery algoritlims In the non-adaptive model, the running time of the recovery 
algorithm is well-defined: it is the number of operations performed by a procedure that takes Ax as its 
input and produces an approximation x* to x. The time needed to generate the measurement vectors A, or 
to encode the vector x using A, is not included. In the adaptive case, the distinction between the matrix 
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generation, encoding and recovery procedures does not exist, since new measurements are generated based 
on the values of the prior ones. Moreover, the running time of the measurement generation procedure heavily 
depends on the representation of the matrix. If we suppose that we may output the matrix in sparse form 
and receive encodings in time bounded by the number of non-zero entries in the matrix, our algorithms run 

in n log*^*^^^ n time. 



3 Full adaptivity 

This section shows how to perform fc-sparse recovery with 0(A;loglog(n/A;)) measurements. The core 
of our algorithm is a method for performing 1-sparse recovery with O(loglogn) measurements. We then 
extend this to fc-sparse recovery via repeated subsampling. 



3.1 1-sparse recovery 



This section discusses recovery of 1-sparse vectors with O (log log n) adaptive measurements. First, in 
Lemma |3?T] we show that if the heavy hitter xj is Q,{n) times larger than the £2 error (xj is "il(n)-heavy"), 
we can find it with two non-adaptive measurements. This corresponds to non-adaptive 1-sparse recovery 
with approximation factor C = ©(n); achieving this with 0(1) measurements is unsurprising, because the 
lower bound f DIPWlOl is n{log^^c 

Lemma |3?T] is not directly very useful, since xj is unlikely to be that large. However, if xj is D times 
larger than everything else, we can partition the coordinates of x into D random blocks of size N/ D and 
perform dimensionality reduction on each block. The result will in expectation be a vector of size D where 
the block containing j is D times larger than anything else. The first lemma applies, so we can recover the 
block containing j, which has a 1/ \/D fraction of the £2 noise. Lemma gives this result. 

We then have that with two non-adaptive measurements of a D-heavy hitter we can restrict to a subset 
where it is an -heavy hitter. Iterating log log n times gives the result, as shown in Lemma 1331 



Lemma 3.1. Suppose there exists a j with \xj 



n]\{j}\ 



, for some constant C. Then two non- 



adaptive measurements suffice to recover j with probability 1 — 6. 



Proof. Let s: [n] — )• {±1} be chosen from a 2- wise independent hash family. Perform the measurements 
a(x) = s{i)xi and b{x) = + i)s{i)xi. For recovery, output the closest integer to b/a — n. 



Let z = Then E[a(z)^ 

1 — 26, we have both 



and E[b{zf] < in^ 



12- 



Hence with probabiUty at least 



\b{z)\ < 2ny/T/6\\z\\2 



Thus 



b{x) 



a{x) 



b{x) _s{j){n+j)xj +b{z) 



a{x) 
{n + j) 



s{j)xj + a{z) 
b{z) - {n + j)a{z) 



s{j)xj + a{z) 
\b{z)\ + {n + j)\a{z)\ 
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Suppose \xj\ > (8n + 1)^^1/6 H^Hg. Then 

b{x) 



a{x) 



(n + j) 



< 



8n 1^1/5 \\z\\2 
=1/2 



so ? = J. 



□ 



Lemma 3.2. Suppose there exists a j with \xj\ > ||x[„]\|j} ||^ /or 5ome constant C and parameters B 
and 6. Then with two non-adaptive measurements, with probability 1 — 6 we can find a set S C [n] such 
that i G 5 and \\xs\{j}\\2 — ||^N\{i}ll2 1*^1 < 1 + n/ B'^. 

Proof. Let D = B'^ jb, and let h: [n] — )• \D\ and s: [n] — )• {±1} be chosen from pairwise independent 
hash famihes. Then define 5p = {i G [n] | h{i) = p}. Define the matrix A G M^^" by A^-j^^ = s{i) and 
Ap^i = elsewhere. Then 



{Az)p =Y,s^ 



I Zi. 



Letp* = h{j) and y = x^n]\{j}- We have that 



E[|5p.|] =l + (n-l)/L» 

2 

y5„ " 



I l|2 

|y|l2 



EPy)2.]=E[ 

Hence by Chebyshev's inequality, with probability at least 1 — 4(5 all of the following hold: 

\Sp* \ <1 + (n - 1)/{DS) < 1 + n/B^ 



ys„ 



<\\y\\2/VD5 



\{Ay)p*\<\\y\\J^ 
\\Ay\\2<\\y\\jV5. 

The combination of Q and ^ imply 



(4) 
(5) 

(6) 
(V) 



CD 



\{Ax)p*\ > \x,\ - \{Ay)p,\ > {CD/5 - 1/VD6) > {CD/5 - 1/^5)^5 WAyW^ > ^ WAyW^ 



and hence 



\{Ax) 



> — = \\{Ax\r)^\p, 



2\f5 



l2 ■ 



As long as C/2 is larger than the constant in Lemma lTTl this means two non-adaptive measurements suffice 
to recover p* with probability 1 — 5. We then output the set Sp* , which by ^ has 



< \\y\\2/y/D5= \\xin]\{j}\\2/^ = \\xin]\{j}\\2/B 



as desired. The overall failure probability is 1 — 55; rescaling 5 and C gives the result. 



□ 



Lemma 3.3. Suppose there exists a j with \xj\ > C \ \x[n]\{j}\\2f^^ some constant C. Then O(loglogn) 
adaptive measurements suffice to recover j with probability 1/2. 
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procedure NonAdaptiveShrink(x, D) > Find smaller set S containing heavy coordinate Xj 

For i G [n], si{i) ^ {±1}, h{i) ^ [D] 
Fori e [D],S2{i) ^ {±1} 

a Yl si{i)s2{h(i))xi > Observation 

b ^ "Y si{i)s2{h{i))xi{D + h{i)) > Observation 

p* <— ROUND(&/a - D). 
return {i* | h{j*)=p*}. 
end procedure 

procedure AdaptiveOneSparseRecCx) o Recover heavy coordinate Xj 

S ^ [n] 

B ^2,5 ^1/4 
while 1 5| > Ido 

S ^ NonAdaptiveShrink(x5, 45^/(5) 

B ^ ^3/2, S ^ 5/2. 
end while 
return ^[O] 
end procedure 

Algorithm 3.1: Adaptive 1-sparse recovery 



Proof. Let C' be the constant from Lemma [3^ Define Bq = 2 and Bi = B-_^ for z > 1. Define 6i = 2 */4 
for i > 0. Suppose C > 16C'B^/6l 

Define r = 0(log log n) so Br > n. Starting with Sq = [n], our algorithm iteratively applies Lemma [l!2l 
with parameters B = ABi and 6 = 6i to xs^ to identify a set C Si with j G ^j+i, ending when i = r. 

We prove by induction that Lemma [l!2l applies at the ith iteration. We chose C to match the base case. 
For the inductive step, suppose ||2;s'.\{j} < /(C"16-^). Then by Lemma [3^ 

so the lemma applies in the next iteration as well, as desired. 

After r iterations, we have Sr < I + n/B^ < 2, so we have uniquely identified j £ Sr. The probability 
that any iteration fails is at most < 2<Jo = 1/2. □ 



3.2 fc-sparse recovery 

Given a 1-sparse recovery algorithm using m measurements, one can use subsampling to build a /c-sparse 
recovery algorithm using 0{km) measurements and achieving constant success probability. Our method 
for doing so is quite similar to one used in [ GLPSIOI . The main difference is that, in order to identify one 
large coefficient among a subset of coordinates, we use the adaptive algorithm from the previous section as 
opposed to error-correcting codes. 

For intuition, straightforward subsampling at rate 1/k will, with constant probability, recover (say) 90% 
of the heavy hitters using 0{km) measurements. This reduces the problem to A;/10-sparse recovery: we can 
subsample at rate 10/ A; and recover 90% of the remainder with O(A;m,/10) measurements, and repeat log k 
times. The number of measurements decreases geometrically, for 0{km) total measurements. Naively 
doing this would multiply the failure probability and the approximation error by log k; however, we can 
make the number of measurements decay less quickly than the sparsity. This allows the failure probability 
and approximation ratios to also decay exponentially so their total remains constant. 
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To determine the number of rounds, note that the initial set of 0{km) measurements can be done in 
parallel for each subsampling, so only 0{m) rounds are necessary to get the first 90% of heavy hitters. 
Repeating log times would require 0{m\ogk) rounds. However, we can actually make the sparsity in 
subsequent iterations decay super-exponentially, in fact as a power tower. This give 0{m\og* k) rounds. 

Theorem 3.4. There exists an adaptive {l+e)-approximate k-sparse recovery scheme with 0{^k log | log log(ne/A;)) 
measurements and success probability 1 — 5. It uses 0(log* k log log(ne)) rounds. 

To prove this, we start from the following lemma: 

Lemma 3.5. We can perform 0(log log(n/A;)) adaptive measurements and recover an i such that, for any 
3 G Hk^i/k{x) we have Pr[z = j] = Q.{l/k). 

Proof. Let S = Hi^{x). Let T C [n] contain each element independently with probability p = 1/(4C^A;), 
where C is the constant in Lemma [331 Let j G Hj^ i/i.{x). Then we have 

m^T\s\\i]=p\\xs\\i 

so ^ 

with probability at least 3/4. Furthermore we have E[|T \ S\] < pn so |T \ 51 < n/k with probability at 
least 1 — 1/(4C^) > 3/4. By the union bound, both these events occur with probability at least 1/2. 
Independently of this, we have 

Pr[r nS = {j}] = p(l - p)''-^ > p/e 

so all these events hold with probability at least p/ (2e). Assuming this, 

Il^n{i}ll2 ^ l^il/*^ 

and |T| < 1 + n/k. But then Lemma l33] applies, and 0(log log |T|) = 0(log log(n/A;)) measurements can 
recover j from a sketch of xt with probability 1/2. This is independent of the previous probability, for a 
total success chance of p/(4e) = Q{l/k). □ 

Lemma 3.6. With 0{^k log log log{ne/k)) adaptive measurements, we can recover T with \T\ < k and 

Err^(x5T, fk) < (1 + e) Err'^{x, k) 
with probability at least 1 — 5. The number of rounds required is 0(log log(ne/A;)). 

Proof. Repeat Lemma [33] m = 0{^k log ) times in parallel with parameters n and k/eto get coordinates 
T' = {ti,t2, . . . , tjn}. For each j G Hi^ ,,/i^{x) C Hf./^^^/j^{x) and i G [m], the lemma implies Pr[j = ti] > 
e/{Ck) for some constant C. Then Pr[j ^ T] < (1 - e/(C/c))™ < e'^""/^^^^ < fd for appropriate m. 
Thus 

n\Hk,e/k{x) \ T'\] < f6 \Hk,,/k{x)\ <fSk 
Pj:[\Hk,,/k{x)\T'\>fk] <5. 

Now, observe x^i directly and set T C T' to be the locations of the largest k values. Then, since 
Hk,e/k{x) ^ Hk{x), \Hk^e/k{x) \T\ = \Hk^e/k{x) \ T'\ < fk with probability at least 1-d. 
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Suppose this occurs, and let y = x^. Then 



Err {yjk)= mm Jysll^ 



< 



yHk,e/k{x)\T 



''Hk,e/k('J^) 
2 



X 



Hk{x) 



< 



X-, 



Hkix) 
<(l + e) 



+ 



XHk[x)\Hk,^ik{x) 



+ k 



^Hk{x)\Hk^^/^{x) 
2 

Hk{x) 2 

(1 + e) Err2(x,A;) 



as desired. 



□ 



procedure AdaptiveKSparseRecCx, A;, e, S) 

Ro ^ [n] 

5o ^ 6/2, eo ^ e/e, /o ^ 1/32, ko i- k. 
J^{} 

fori<-0, ...,0(log* k) do 

fort^O,...,@{j:ki\ogj-)do 

St ^ SUBSAMPLE(i?i, G(ei/fci)) 
J.add(ADAPTIVEONESPARSEREC 

end for 

Ri+i ^[n]\J 
Si+i ^ Si/8 
ei+i ^ €i/2 

^ l/2V(4'+V,) 

end for 

X Xj 

return x 
end procedure 



t> Recover approximation x of x 



> While ki > 1 



> Direct observation 



Algorithm 3.2: Adaptive fc-sparse recovery 

Theorem 3.7. We can perform 0{^k\og ^ \og\og{ne/k)) adaptive measurements and 
size at most 2k with 

^ X 



recover a set T of 



Hk{x) 



\\xt\\2 < (1 + 

with probability 1 — 6. The number of rounds required is 0(log* k log log(ne)). 

Proof Define = 2^ and = Let /o = 1/32 and fi = 2-^(4^ /i-i) for i > 0, and define 
ki = klljKi fj- Let Ro = [n]. 

Let r = 0(log* k) such that fr-i < 1/k. This is possible since ai = 1/ (4*+^ /j) satisfies the recurrence 
ao = 8 and ai = 2"i-i-2»-2 > 2"i-i/2. Thus a^_i > for r = 0(log* k) and then fr-i < 1/ar-i < 
1/k. 
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For each round i = 0,...,r — 1, the algorithm runs Lemma [X6l on xr^ with parameters e^, ki, fi, and 
Si to get Tj. It sets Ri+i = Ri\Ti and repeats. At the end, it outputs T = UTj. 
The total number of measurements is 

0{J2 log tV log(^e^/^^)) ^OiY, ?LiM3MllIAk{i + log 1) log(log(A./fc,) + log(ne/A:))) 

:0(-A: log i log log(ne/A:) ^ 2*(fci/A:) log(l//i)(i + 1) log log(fc/A:,)) 



<( 



using the very crude bounds i + log(l/5) < (i + 1) log(l/5) and log(a + 6) < 2 log a log b for a,b > e. 
But then 

5^2*(fc,A)log(l//i)(i + l)loglog(fe/A:,) <^2^(i + l)/ilog(l//.)loglog(l//i) 

<^2^(i + l)0(/A) 
=0(1) 

since fi < 0(1/16*), giving 0(ifclog ^ loglog(ne/A;) total measurements. The probability that any of 
the iterations fail is at most < 5. The result has size \T\ < — ^^^^ remains is the 

approximation ratio ||xjt||2 = ||a^i?rll2- 
For each i, we have 

Err^(xR^_i.i, A;i+i) = Eir^ {x ji^\Tv fih) < (1 + e^) Err^(xij^, A;i). 

Furthermore, kr < kfr-i < 1. Hence 

/r— 1 \ /r— 1 \ 



vi=0 / \i=0 



\\xRj\l=Ei?{xR^,kr) < ( n(l + e^) I Err2(xfi„,A:o) = ( 11(1 + ) Err2(x,A:) 
But ni=o(l + ei) < eS'^ < e, so 

JJ(l + ei) < 1 + < l + 2e 



r-l 



i=0 

and hence 

as desired. □ 



Once we find the support T, we can observe xt directly with 0{k) measurements to get a (1 + e)- 
approximate fc-sparse recovery scheme, proving Theorem 13. 41 

4 Two-round adaptivity 

The algorithms in this section are invariant under permutation. Therefore, for simplicity of notation, the 
analysis assumes our vectors x is sorted: > ... > \xn\ =0. 

We are given a 1-round /c-sparse recovery algorithm for n-dimensional vectors x using m{k, e, n, 6) 
measurements with the guarantee that its output x satisfies ||x — x\\p < (1 + e) • ||3;|^||p for a p G {1, 2} 
with probability at least 1 — 6. Moreover, suppose its output x has support on a set of size s{k, e, n, 6). We 
show the following black box two-round transformation. 
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Theorem 4.1. Assume s(k,e,n,5) = 0{k). Then there is a 2-round sparse recovery algorithm for n- 
dimensional vectors x, which, in the first round uses m{k, e/5, poly(A:/e), 1/100) measurements and in the 
second uses 0{k ■ m{l, 1, n, Q{l/k))) measurements. It succeeds with constant probability. 

Corollary 4.2. For p = 2, there is a 2-round sparse recovery algorithm for n-dimensional vectors x such 
that the total number of measurements is 0(^A: log(/c/e) + k log{n/k)). 

Proof of Corollary \4.2\ In the first round it suffices to use CountSketch with s{k, e, n, 1/100) = 2k, which 
holds for any e > IPWILI . We also have that m(/c, e/5, poly(fc/e), 1/100) = 0{\k\og{k/e)). Us- 
ing IICCF02[ ICM061 iGlTOll . in the second round we can set m(l, l,n, 9(1/^)) = 0{\ogn). The bound 
follows by observing that ^k\og{k/ e) + A;log(n) = 0{^k\og{k/e) + k\og{n/k)). □ 

Proof of Theorem WA\ In the first round we perform a dimensionality reduction of the n-dimensional input 
vector X to a poly (A;/e) -dimensional input vector y. We then apply the black box sparse recovery algorithm 
on the reduced vector y, obtaining a list of e/5, poly(/i;/e), 1/100) coordinates, and show for each 
coordinate in the list, if we choose the largest preimage for it in x, then this list of coordinates can be used 
to provide a 1 + e approximation for x. In the second round we then identify which heavy coordinates in 
X map to those found in the first round, for which it suffices to invoke the black box algorithm with only a 
constant approximation. We place the estimated values of the heavy coordinates obtained in the first pass in 
the locations of the heavy coordinates obtained in the second pass. 

Let = poly(/c/e) be determined below. Let /i : [n] — [A^] and cr : [n] — { — 1, 1} be 0(log A^)-wise 
independent random functions. Define the vector y by = | Let Y{i) be the vector x 

restricted to coordinates j E [n] for which h{j) = i. Because the algorithm is invariant under permutation 
of coordinates of y, we may assume for simplicity of notation that y is sorted: > ... > ly^l =0. 

We note that such a dimensionality reduction is often used in the streaming literature. For example, the 
sketch of ITZ04II for ^2-iiorm estimation utilizes such a mapping. A "multishot" version (that uses several 
functions h) has been used before in the context of sparse recovery tCCFOll ICM06t (see liGIlOl for an 
overview). Here, however, we need to analyze a "single-shot" version. 

Let p G {1, 2}, and consider sparse recovery with the ip/ip guarantee. We can assume that ||x||p = 1. 
We need two facts concerning concentration of measure. 

Fact 4.3. (see, e.g.. Lemma 2 of HKNPWlUil } Let Xi, . . . , Xn be such that Xi has expectation /ij and 
variance vf, and Xi < K almost surely. Then if the Xi are £-wise independent for an even integer £ > 2, 

Pr 

where /x = ^ • /ij and = vf. 

Fact 4.4. (Khintchine inequality) ( l{Haa82\l } For t > 2, a vector z and a t-wise independent random sign 
vector a of the same number of dimensions, E[| (z, o")|*] < ||z||2(-v/t)*. 

We start with a probabilistic lemma. Let Z{j) denote the vector Y{j) with the coordinate m{j) of largest 
magnitude removed. 

Lemma 4.5. Let r = O (\\xj^\\p • ^^jwj '^"'^ ^ sufficiently large. Then with probability > 99/100, 

1. \/je[N], \\Z{j)\\p<r. 

2. Vi G \aii) ■ yh(i) - Xi\ < r, 

3. WyjuWp < (1 + 0{1/VN)) ■ \\xjj.\\p + 0{kr), 



n 

E- 

i=l 



Xi 



> A 
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4. \fj e [A^], ifh-^{i) n [N^l^] = 0, then \yj\ < r, 

5. yj e [N], \\Y{j)\\o = 0{n/N + logN). 

Proof. We start by defining events £, T and Q tliat will be helpful in the analysis, and showing that all of 
them are satisfied simultaneously with constant probability. 

Event E\ Let E be the event that /i(2), . . . , h(N^I-'') are distinct. Then VxyXE\ > 1 - l/iV^/^. 

Event T: Fix i G [A^]. Let Z' denote the vector y(/i(i)) with the coordinate i removed. Applying Fact 14.41 
withi = e(logiV), 

^A\<y{i)yuii) - x^| > • \\Z{h{i))U < Pr[|a(i)y^(,) - Xi\ > 2Vi ■ ||Z'|| J 

< Fi[\a{i)yh(i)-Xi\'>2\Vi)'-\\Z'\\i] 

< Fi[\a{i)y^i)-Xi\'>2'E[\{a,Z')\']] 

= Pi[\aii)ym) - Xi\' > 2'EMi)yhii) - Xi\'] < l/N^/\ 

Let J" be the event that for alH G [N], \a{i)yh(^i) - Xi\ < 2yft- ||Z(/i(i))||2, soPr^[J"] > 1 - 1/iV. 

Event Q\ Fix j G \N\ and for each i G {N^l'-^ + 1, . . . let Xi = \xi\Plh(i)=j (i.e., Xi = \xi\P if 
h{i) = j). We apply Lemma l43] to the Xi. In the notation of that lemma, //j = \xi\P/N and vf < \xi\'^P/N, 
and so /i = [[x ^^^^yaj ||p/A^ and v"^ < \\x-g^jYj^\\2p/N. Also, K = |x^i/3_,_i|^. Function h is 0(log Af)-wise 
independent, so by Fact l4.3[ 



Pr 



> A 



N 

for any A > and a.n£ = G(log N). For £ large enough, there is a 



<20W (^(||x^||^^^/^/(A^/iV))' + (|x^v3+ir^/A)' 



A = e(||xj^||fpvWV)7iV + • logiV) 

for which this probabihty is < N^'^. Let Q be the event that for all j G [N], \\Z{j)\\^ 
for some universal constant C > 0. Then Pr[t/ | <5] > 1 — 1/A^. 

By a union bound, Pr[£: A J" A a] > 999/1000 for N sufficiently large. 

We know proceed to proving the five conditions in the lemma statement. In the analysis we assume that 
the event E l\T l\Q holds (i.e., we condition on that event). 

First Condition: This condition follows from the occurrence of Q, and using that [[x ^^^^^gj ||2p < ll^^ jjyi/a] llp» 

and lk jjyi/3j lip < Iklfcyllp' '^^11 [N^l"^ — k + l)|a;^i/3^^i|^ — ll^pijllp- One just needs to make these 
substitutions into the variable A defining Q and show the value r serves as an upper bound (in fact, there is 
a lot of room to spare, e.g., r/ log N is also an upper bound). 

Second Condition: This condition follows from the joint occurrence of £, F, and Q. 
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Third Condition: For the third condition, let y' denote the restriction of y to coordinates in the set 
[N] \ {h{l),h{2), ...,h{k)}. For p = 1 and for any choice of h and a, \\y'\\i < For p = 2, 

the vector y is the sketch of IITZ04II for £2 -estimation. By their analysis, with probability > 999/1000, 
lly'lli ^ (1 + 0(l/\/iV))||a;'|||, where x' is the vector whose support is [n] \ VJ^^^h~^{i) C [n] \ [k]. We 
assume this occurs and add 1/1000 to our error probability. Hence, ||y'||2 < (1 + 0(l/\/iV))||xpjy||2. 

We relate ||y'||p to ||j/pj||p- Consider any j = h{i) for an z G [k] for which j is not among the top k 
coordinates of y. Call such a j lost. By the first condition of the lemma, \a{i)yj — Xi\ < r. Since j is 
not among the top k coordinates of y, there is a coordinate j' among the top k coordinates of y for which 
j' ^ h{[k]) and \yji\ > \yj\ > — r. We call such a j' a substitute. We can bijectively map substitutes to 
lost coordinates. It follows that \\yj^\\p < \\y'\\f> + 0{kr) < [1 + 0{l/\/N))\\xjf^\\^ + 0{kr). 

Fourth Condition: This follows from the joint occurrence of £,T, and Q, and using that 1x^^)1^ < 
||a;j^||^/(iVi/3 -k + 1) since m(j) ^ 

Fifth Condition: For the fifth condition, fix j G [A^] . We apply Fact 14.31 where the Xi are indicator 
variables for the event h{i) = j. Then E[Xj] = 1/A^ and Var[Xj] < 1/A^. In the notation of Fact[ 



^ = n/N, < n/N, and K = 1. Setting i = e(logiV) and A = 6 (log N + y/{ n log N) /N), we have 
by a union bound that for all i G [N], \\Y{j)\\o < ^ + Q{logN + y'inlog N)/N) = 0{n/N + log N), 
with probability at least 1 — 1/A^. 

By a union bound, all events jointly occur with probability at least 99/100, which completes the proof. □ 

Event Ti: Let Ti be the event that the algorithm returns a vector y with — y||p<(l + e/5)||y|^||p- Then 
Fi[n] > 99/100. Let 5 be the support of y, so IS*] = e/5, A^, 1/100). We condition onTi. 

In the second round we run the algorithm on Y{j) for each j G S, each using m(l,l,||y(j)||o,6(l/A;)))- 
measurements. Using the fifth condition of Lemma 14.51 we have that ||^^(i)||o = 0{en/k + log{k)/e) for 
N = poly(A;/e) sufficiently large. 

For each invocation on a vector Y{j) corresponding to a j G S, the algorithm takes the largest (in 
magnitude) coordinate HH(j) in the output vector, breaking ties arbitrarily. We output the vector x with 
support equal to T = {HH(j) | j G S}. We assign the value a{xj)yj to HH(j). We have 

||x - x\\P =\\{x - x)t\\^ + \\{x- x)[^]\t\\p = \\{x- x)t\\^ + (8) 
The rest of the analysis is devoted to bounding the RHS of equation [8] 

Lemma 4.6. For N = poly(/c/e) sufficiently large, conditioned on the events of Lemma \4.5\ and %, 

\\x^n]\T\\l<{l + €/m^^\\l- 

Proof. If [k\\T = 0, the lemma follows by definition. Otherwise, if i G ([/c] \ T), then i G [k], and so 
by the second condition of Lemma 14.51 \xi\ < \yh{i) \ + r. We also use the third condition of Lemma 14.5 1 to 
obtain ||y|^||p < (1 + 0{1/^/N)) ■ ||a;|^||p + 0{kr). By the triangle inequality, 

)l/p / s l/p / \ 1/p 

\i e [k]\T I \i& [N]\S 

<kVvr + {l + e/f>)-\\yj^\\p. 

The lemma follows using that r = 0(||xpj||2 • (log N)/N^/^) and = poly(A;/e) is sufficiently large. □ 
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We bound \\{x — i;)T||p using Lemma 1431 \S\ < poly(A;/e), and that N = poly(A;/e) is sufficiently large. 

Il(^ - £)t|Ip < \^hhU) - c7{HH{j)) ■ yA < (^Hy, - y,\ + \a{HH{j)) ■ x^hU) " VjlT 
Vies / \jes 





< I E - y^^" + E • ^HHij) - yj\ 

yj&s J \jes 

<(1 + e/5)||yj^||p + \a{HH{j)) ■ xhh[,) - j 

i/p 

<(1 + e/5)(l + 0(l/^/iV))||X[^||p + 0{kT) + I HHHm ■ xhhu) - Vjl" ' 

<(1 + e/4)||xp]||p + Yl • ''HHU) - Vj 

Event Z: We condition on the event I that all second round invocations succeed. Note that Pr [X] > 99/100. 
We need the following lemma concerning 1-sparse recovery algorithms. 

Lemma 4.7. Let w be a vector of real numbers. Suppose \wi\^ > ^ • \\w\\p. Then for any vector w for 
which \\w — w\\p < 2 • we have \wi\p > | • Moreover, for all j > 1, liVjl^ < 




Proof. \\w — w\\p > \wi — wi\P, so if \wi\P < | • \\w\\p, then \\w — w\\p > — |) \\w\\n = ^ ■ \\w\ 



IP 

IP ~ 10 

On the other hand, HtL'rjTlIp < jq • \\w\\p- This contradicts that \\w — w\\p < 2 • HwtytHp. For the second 



11] I 

part, for j > 1 we have \wj\^ < ^ • \\w\\p. Now, — w\\p > \wj — Wjl^, so if \wj\P > | • \\w\\p, 
then \\w — w\\p > — jq) \\w\\p = ^ " II^IIp- But since H-Wpyllp < Jo ' WMIp' this contradicts that 
\\w — wWp < 2 • IIuiiyjIIp- □ 

It remains to bound Yljes\'^(-^^(j)) '^HH{j) —Vjl^- We show for every j G S, \a{HH{j)) ■ XHH(j) — 
yj \P is small. 

Recall that m{j) is the coordinate of Y{j) with the largest magnitude. There are two cases. 

Case 1: m(j) ^ In this case observe that HH{j) ^ [N^/^] either, and /I'^j) [N^^^] = 0- 

It follows by the fourth condition of Lemma l4.5l that \yj\ < r. Notice that Ixhhq-^I^ < < jyvs-l ' 

Bounding \a{HH{j)) ■ XHH{j) ~ Ujl by \xHH{j)\ + IVjl^ it follows for N = poly(A;/e) large enough that 
\a{HH{j)) . XHHU) - yjl" < • lk[fc]llp/|5|)- 

Case 2: G [A^^/'^]. If HH{j) = m{j), then \a{HH{j)) ■ XHH{j) ~ yj\ < by the second con- 

dition of Lemma 14.51 and therefore 

\a{HH{j)) ■ XHHU) - Vi^" ^^"^ • \\x^\\p/\S\ 
for N = poly(/c/e) large enough. 
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Otherwise, HH{j) ^ m,{j). From condition 2 of Lemma I4.5l and m{j) G [A^^^^], it follows that 

\a{HH{j)))xHH{j) - yj\ <W{HH{j))xHH{j) - (^{m{j))x^^j'^\ + \a{m{j))x^^j) - yj\ < \xHH{j)\ + km{j)l + 

Notice that \xhhq-^ \ + jx^^)! < 2|xm(j)| since m{j) is the coordinate of largest magnitude. Now, condi- 
tioned onX, Lemma |4]7] implies that < ^ • ||y(j)||p, or equivalently, \xm{j) I < IQi/P • \\Z{j)\\p. 
Finally, by the first condition of Lemma 14 5 [ we have = 0(r), and so \a{HH{j))x}jH(j) —yj\^ = 
0{r^), which as argued above, is small enough for = poly{k/e) sufficiently large. 

The proof of our theorem follows by a union bound over the events that we defined. □ 

5 Adaptively Finding a Duplicate in a Data Stream 

We consider the following FindDuplicate problem. We are given an adversarially ordered stream 5 of n 
elements in{l,2,...,n — 1} and the goal is to output an element that occurs at least twice, with probability 
at least 1 — 6. We seek to minimize the space complexity of such an algorithm. We improve the space 
complexity of IIJSTllll for FindDuplicate from 0(log^ n) bits to 0(log n) bits, though we use 0(log log n) 
passes instead of a single pass. Notice that liJSTllll also proves a lower bound of r2(log^ n) bits for a single 
pass. 

We use Lemma [331 of our multi-pass sparse recovery algorithm: 

Fact 5.1. Suppose there exists an i with > C'||a;[n]\{j}||2 /o?" some constant C. Then 0(loglog?i) 
adaptive measurements suffice to recover a set T of constant size so that i £ T with probability at least 1/2. 
Further, all adaptive measurements are linear combinations with integer coefficients of magnitude bounded 
by poly(n). 

Our algorithm DuplicateFinder for this problem considers the equivalent formulation of FindDuplicate 
in which we think of an underlying frequency vector x G {— l,0,l,...,n — l}**. We start by initializing 
Xi = —1 for all i. Each time item i occurs in the stream, we increment its frequency by 1. The task is 
therefore to output an i for which Xi > 0. 

Theorem 5.2. There is an 0(log log n)-pass, 0(log n log 1/(5) bits of space per pass algorithm for solving 
the FindDuplicate problem with probability at least 1 — 6. 

Proof. We describe an algorithm DuplicateFinder which succeeds with probability at least 1/8. Since it 
knows whether or not it succeeds, the probability can be amplified to 1 — (5 by 0(log 1/6) independent par- 
allel repetitions. It is easy to see that the pass and space complexity are as claimed, so we prove correctness. 
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DuplicateFinder(5) 

1. Repeat the following procedure C = 0(1) times independently. 

(a) Select 0(l)-wise independent uniform tj G [0, 1] for i G [n]. 

(b) Let e > be a sufficiently small constant. Let m = 0(log 1/e). 

(c) Let z^, . . . , z^"* be a pairwise-independent partition of the coordinates of z, where z = Xi/ti for 
alH. 

(d) Run algorithm A independently on vectors z^, . . . , z^™. 

(e) Let Ti, . . . , T^^m be the outputs of algorithm ^4 on z^, . . . , z^™, respectively, as per Fact 15. II 

(f) Compute each Xi for i G uj^^Tj in an extra pass. If there is an i for which Xj > 0, then output 
i. 

2. If no coordinate i has been output, then output FAIL. 



We use the following fact shown in the proof of Lemma 3 of liJSTIII . 

Lemma 5.3. (see first paragraph of Lemma 3 of hJSTlT^ ) For a single index i G [n] and t arbitrary, we 
have ^ 

p^[II^h;;;mII2 > -^Vm\\x\\i \ti = t] = o(e), 

where Hm{z) denotes the set of m largest (in magnitude) coordinates of z. Suppose |zj| > ||x||i for 
some value of i. This happens if ti < and occurs with probabihty equal to ||^. Conditioned on this 
event, by Lemma 153] we have that with probability 1 — 0(e), 

1 ^„ „ 

2 < -^VlT^WxWl- 



- 20 

Suppose i occurs in z^ for some value of j G [Am]. Since the partition is pairwise-independent, the expected 
number of £ G Hm{z) \ {i} which occur in z^ is at most and so with probabihty at least 3/4 — 0(e), 
the norm of z^ with coordinate corresponding to coordinate z in z removed is at most 



II H.^{z)\\^ - II 11^ - 20 

Since m is a constant, by Fact l5.1[ with probability at least 1/2, yl outputs a set T which contains coordinate 
i. Hence, with probability at least 3/8 — 0(e), if there is an i for which |zj| > ||x||i, it is found by 
DuplicateFinder. 

Let Pi = Then |zj| > ||x||i with probability pi. Since YliXi > 0, we have Y^- | ^.^QPi > ^. 

Consider one of the C = 0(1) independent repetitions of step 1. 

For coordinates i for which Xi > 0, let Wi = I if |zj| > ||x||i, and let W = Y,i ^i- Then E[VF] > 1/2 
and by pairwise-independence, Var[Ty] < E[Ty]. 

Let W be the average of the random variable W over C independent repetitions. Then E[14^'] = 
'Fi\W] > ^ and Var[W^'] < and so by Chebyshev's inequality for C = 0(1) sufficiently large we 

have that with probability at least i, W > 0, which means that in one of the C repetitions there is a 
coordinate i for which Xj > and |zj| > ||x||i. 

Hence, the overall probability of success is at least 1/2 • (3/8 — 0(e)) > 1/8, for e sufficiently small. 
This completes the proof. □ 
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